Unexpected Productions May Well be Errors

نویسندگان

  • Tylman Ule
  • Kiril Ivanov Simov
چکیده

We present a method for detecting annotation errors in treebanks. It assumes that errors are unexpected small tree fragments. We generate statistics over configurations of these fragments using a standard statistical test. We use the test result and the characteristics of their distributions as features to classify unseen configurations as likely errors via machine learning. Evaluation shows that the resulting list of error candidates is reliable, independent of corpus size, annotation quality, and target language. Setting up language resources involves considerable effort, because human intervention is inevitable and costly. Human annotators are essential, because they usually outperform automatic methods in terms of annotation accuracy, but they still make their own kind of errors. In addition to genuine mistakes, they do not always behave identically each time when presented with the same infrequent problem. Thus one can expect a number of errors to be present in any hand-built language resource. We divide these errors into the following categories: violations of the annotation guidelines and violations of language principles not covered by the annotation guidelines. Additionally, following Blaheta (2002), errors can also be: detectable – errors that are easy to spot and fix by using queries over the annotation that define impossible configurations and transformations for correction; fixable – errors which can be found automatically, but that require human intervention for correction; systematic inconsistencies – errors which are not covered by the annotation guidelines, or errors not described precisely enough in the guidelines. These two classifications of errors in annotated corpora are orthogonal, but not independent: we can expect errors that are violations of the annotation guidelines to be usually detectable and fixable, and those that are a violation of language principles, but not covered by the annotation guidelines, to be more frequently systematic inconsistencies. Each class of errors requires a specific way for detection and correction. Detectable errors covered by the guidelines are the easiest in this respect. They can be addressed by encoding the guidelines in a formal way and by testing the corpus for consistency. Detecting the other types of errors requires additional linguistic knowledge. Such knowledge is not always available or easy to acquire, so that other mechanisms are desirable for error detection. We divide those methods into symbolic and non-symbolic approaches. The symbolic approaches are based on (linguistically motivated) pattern matching selecting possible deviations from linguistically correct occurrences. Patterns can be devised by human annotators, or they can be extracted (semi-)automatically from the corpus itself. The non-symbolic approaches use statistical methods to find rare events in the annotated corpus, where an event is a certain fragment of the annotation. In general, these methods can find errors in each of the above categories, but they are especially useful when pattern-based approaches are not easily applicable, because patterns are difficult to find. We present such a non-symbolic method that attacks errors and inconsistencies in structural annotation, and that shows good performance across languages and annotation schemes. We detect errors and inconsistencies that appear as unexpected events in a corpus using a variant of Directed Treebank Refinement (DTR; Ule, 2003) on artificially introduced errors and apply machine learning (ML) to produce fully automatically a list of likely error candidates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Kinematic Analysis of Speech Sound Sequencing Errors Induced by Delayed Auditory Feedback.

Purpose Delayed auditory feedback (DAF) causes speakers to become disfluent and make phonological errors. Methods for assessing the kinematics of speech errors are lacking, with most DAF studies relying on auditory perceptual analyses, which may be problematic, as errors judged to be categorical may actually represent blends of sounds or articulatory errors. Method Eight typical speakers prod...

متن کامل

The detection of faked identity using unexpected questions and mouse dynamics

The detection of faked identities is a major problem in security. Current memory-detection techniques cannot be used as they require prior knowledge of the respondent's true identity. Here, we report a novel technique for detecting faked identities based on the use of unexpected questions that may be used to check the respondent identity without any prior autobiographical information. While tru...

متن کامل

The Call Triangle: student, teacher and institution Correcting erroneous N+N structures in the productions of French users of English

This article presents the preliminary steps to the implementation of detection and correction strategies for the erroneous use of N+N structures in the written productions of French-speaking advanced users of English. This research is carried out as part of the grammar checking project CorrecTools, in which errors are detected and corrected using linguistic-based NLP techniques. We use informat...

متن کامل

Fka: Systematic and Incidental Sound Errors in Child Language Productions

Word forms in early child language  Word forms in child language productions often deviate from their target adult forms.

متن کامل

Vocabulary, phonological awareness and rapid naming: contributions for spelling and written production.

PURPOSE To investigate if the performance on linguistic tasks would be predictive of orthographic domain and quality of written productions. METHODS Participants were 82 fourth graders of Elementary Education, from public and private schools of São Paulo, with ages ranging from 9 years to 10 years and 2 months. The test battery was composed of an expressive vocabulary test, phonological aware...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004